{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MovieLens数据集实现协同过滤\n",
    "## 协同过滤概述\n",
    "\n",
    "协同过滤（Collaborative Filtering）推荐算法是最经典、最常用的推荐算法。\n",
    "\n",
    "协同过滤主要包括基于用户相似度的推荐(UserCF)和基于物品相似度推荐(ItemCF)。但是在生活实际中，一个用户一般来说只与少部分物品和少部分用户有关联，所以共现矩阵是稀疏的。此时我们要用已有的部分稀疏数据来预测那些空白的物品和数据之间的评分关系，找到最高评分的物品推荐给用户。\n",
    "\n",
    "- **基于用户(user-based)的协同过滤**\n",
    "    - **相似的用户可能喜欢相同物品**。比如用户A和用户B有共同的兴趣或者行为，现在要给用户B推荐物品，就可以把用户A喜欢的物品，并且用户B没看过的物品，推荐给用户B。\n",
    "- **基于物品(item-based)的协同过滤**\n",
    "    - **相似的物品可能被同个用户喜欢**。世界杯期间沃尔玛尿布和啤酒的故事。在世界杯期间，奶爸要喝啤酒看球，又要带娃，啤酒和尿布同时被奶爸所需要，也就是相似商品，可以放在一起销售。本来啤酒和尿布在生活中，并没什么相关性，是通过用户的行为来判断相关性的。\n",
    "- **基于模型(model based)的协同过滤**\n",
    "    - 使用矩阵分解模型来学习用户和物品的协同过滤信息。一般这种协同过滤模型有：SVD，SVD++等。\n",
    "\n",
    "## 实现协同过滤，需要几个步骤：\n",
    "\n",
    "1. 收集用户偏好(评分，浏览记录，购买记录等)\n",
    "2. 找到相似的用户或物品\n",
    "3. 计算并排序\n",
    "\n",
    "## 评估指标\n",
    "\n",
    "比如，你给用户推10个物品，但是只推了5个是用户想要的(正确的)，准确率就是<font size=4 width=\"50%\" height=\"50%\">$\\frac{5}{10}$</font>，而这个用户实际想要的有20个，但是只命中了5个，所以召回率就是<font size=4 width=\"50%\" height=\"50%\">$\\frac{5}{20}$<font>。\n",
    "\n",
    "R(u)是用户u的推荐列表(根据用户在训练集上的行为给用户预测的物品集合)，T(u)是用户在测试集上的行为列表\n",
    "\n",
    "### 准确率\n",
    "准确率：推荐对的物品数量占召回总数量比例\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$Presicion=\\frac{\\sum_{u \\in U}|R(u)| \\cap |T(u)|}{\\sum_{u \\in U}|R(u)|}$</font></div>\n",
    "\n",
    "### 召回率\n",
    "召回率：推荐对的物品数量占总物品数量的比例\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$Recall=\\frac{\\sum_{u \\in U}|R(u)| \\cap |T(u)|}{\\sum_{u \\in U}|T(u)|}$</font></div>\n",
    "\n",
    "# 用户协同\n",
    "## 目标\n",
    "1. 找到目标用户A相似的K个用户\n",
    "2. 将相似的K个用户曾经看过的物品，并且目标用户A没有买过或浏览过的物品推荐给目标用户A\n",
    "\n",
    "## 用户相似度计算\n",
    "假设给定用户u和用户v，N(u)是用户u曾经看过的物品集合，N(v)是用户v曾经看过的用户集合\n",
    "![image.png](用户-物品行为记录.png)\n",
    "\n",
    "jaccard计算用户u和用户v相似度\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$W_{uv}=\\frac{|N(u)\\cap N(v)|}{|N(u)\\cup N(v)|}$</font></div>\n",
    "\n",
    "余弦计算相似度\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$W_{uv}=\\frac{|N(u)\\cap N(v)|}{\\sqrt|N(u)||N(v)|}$</font></div>\n",
    "\n",
    "伪代码：\n",
    "```python\n",
    "def UserSimilarity(train):\n",
    "    W = dict()\n",
    "    for u in train.keys():\n",
    "        for v in train.keys():\n",
    "            if u == v:\n",
    "                continue\n",
    "            W[u][v] = len(train[u] & train[v])\n",
    "            W[u][v] /= math.sqrt(len(train[u]) * len(train[v]) * 1.0)\n",
    "    return W\n",
    "```\n",
    "\n",
    "<font size=4>**思考：这么计算有没有什么问题？**</font>\n",
    "\n",
    "1. 每次计算，都要遍历一遍用户表，时间复杂度是$O(n^2)$\n",
    "2. 用户和用户计算相似度过程中，假如用户之间没有共同看过的物品的，也就是上面的交集为空，这种情况也会被保存\n",
    "\n",
    "<font size=4>**解决**</font>\n",
    "\n",
    "既然有很多用户没有打分，并且这个计算复杂度这么高，我们可不可以换一种思路，上面那是用户-物品表，换成建立物品-用户的倒排表。\n",
    "\n",
    "![image.png](物品用户倒排表.png)\n",
    "\n",
    "<font size=4>**物品-用户倒排表好处？**</font>\n",
    "\n",
    "1. 用户没有过行为的，直接被过滤了，最终我们得到的都是有行为的表，计算上减少了，时间复杂度也降低了\n",
    "2. 将计算的用户u和用户v的相似度矩阵存储起来，大大减少计算量\n",
    "\n",
    "## 给用户推荐和他最相似的K个用户喜欢的物品\n",
    "用户对物品i感兴趣程度\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$p(u,i)=\\sum_{v \\in S(u,K) \\cap N(i)}W_{uv}r_{vi}$</font></div>\n",
    "$S(u,K)$ 包含和用户$u$兴趣最接近的$K$个用户，$N(i)$是对物品$i$有过行为的用户集合，$W_{uv}$是用户$u$和用户$v$的兴趣相似度，$r_{vi}$是用户$v$对物品的感兴趣程度\n",
    "\n",
    "![image.png](用户对物品i的相似度.png)\n",
    "\n",
    "\n",
    "## 用户协同代码实现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T13:00:13.789027Z",
     "start_time": "2020-08-09T13:00:13.766176Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import math\n",
    "import random\n",
    "from pandas import DataFrame\n",
    "class UserBasedCF:\n",
    "    def __init__(self, path):\n",
    "        self.train = {} #用户-物品的评分表 训练集\n",
    "        self.test = {} #用户-物品的评分表 测试集\n",
    "        self.generate_dataset(path)\n",
    "\n",
    "    def loadfile(self, path):\n",
    "        with open(path, 'r', encoding='utf-8') as fp:\n",
    "            for i, line in enumerate(fp):\n",
    "                yield line.strip('\\r\\n')\n",
    "\n",
    "    \n",
    "    def generate_dataset(self, path, pivot=0.7):\n",
    "        #读取文件，并生成用户-物品的评分表和测试集\n",
    "        i = 0\n",
    "        for line in self.loadfile(path):\n",
    "            user, movie, rating, _ = line.split('::')\n",
    "            if i <= 10:\n",
    "                print('{},{},{},{}'.format(user, movie, rating, _))\n",
    "            i += 1\n",
    "            if random.random() < pivot:\n",
    "                self.train.setdefault(user, {})\n",
    "                self.train[user][movie] = int(rating)\n",
    "            else:\n",
    "                self.test.setdefault(user, {})\n",
    "                self.test[user][movie] = int(rating)\n",
    "\n",
    "\n",
    "    def UserSimilarity(self):\n",
    "        #建立物品-用户的倒排表\n",
    "        self.item_users = dict()\n",
    "        for user,items in self.train.items():\n",
    "            for i in items.keys():\n",
    "                if i not in self.item_users:\n",
    "                    self.item_users[i] = set()\n",
    "                self.item_users[i].add(user)\n",
    "\n",
    "        #计算用户-用户共现矩阵\n",
    "        C = dict()  #用户-用户共现矩阵\n",
    "        N = dict()  #用户产生行为的物品个数\n",
    "        for i,users in self.item_users.items():\n",
    "            for u in users:\n",
    "                N.setdefault(u,0)\n",
    "                N[u] += 1\n",
    "                C.setdefault(u,{})\n",
    "                for v in users:\n",
    "                    if u == v:\n",
    "                        continue\n",
    "                    C[u].setdefault(v,0)\n",
    "                    C[u][v] += 1\n",
    "\n",
    "        #计算用户-用户相似度，余弦相似度\n",
    "        self.W = dict()      #相似度矩阵\n",
    "        for u,related_users in C.items():\n",
    "            self.W.setdefault(u,{})\n",
    "            for v,cuv in related_users.items():\n",
    "                self.W[u][v] = cuv / math.sqrt(N[u] * N[v])\n",
    "        return self.W, C, N\n",
    "\n",
    "    #给用户user推荐，前K个相关用户\n",
    "    def Recommend(self,u,K=3,N=10):\n",
    "        rank = dict()\n",
    "        action_item = self.train[u].keys()     #用户user产生过行为的item\n",
    "        # v: 用户v\n",
    "        # wuv：用户u和用户v的相似度\n",
    "        for v,wuv in sorted(self.W[u].items(),key=lambda x:x[1],reverse=True)[0:K]:\n",
    "            #遍历前K个与user最相关的用户\n",
    "            # i：用户v有过行为的物品i\n",
    "            # rvi：用户v对物品i的打分\n",
    "            for i,rvi in self.train[v].items():\n",
    "                if i in action_item:\n",
    "                    continue\n",
    "                rank.setdefault(i,0)\n",
    "                # 用户对物品的感兴趣程度：用户u和用户v的相似度*用户v对物品i的打分\n",
    "                rank[i] += wuv * rvi\n",
    "        return dict(sorted(rank.items(),key=lambda x:x[1],reverse=True)[0:N])   #推荐结果的取前N个\n",
    "    \n",
    "    # 计算召回率和准确率\n",
    "    def recallAndPrecision(self,k=8,nitem=10):\n",
    "        hit = 0\n",
    "        recall = 0\n",
    "        precision = 0\n",
    "        for user, items in self.test.items():\n",
    "            rank = self.Recommend(user, K=k, N=nitem)\n",
    "            hit += len(set(rank.keys()) & set(items.keys()))\n",
    "            recall += len(items)\n",
    "            precision += nitem\n",
    "        return (hit / (recall * 1.0),hit / (precision * 1.0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T13:00:15.330162Z",
     "start_time": "2020-08-09T13:00:15.321232Z"
    }
   },
   "outputs": [],
   "source": [
    "def print_2_dim_dic(dic, n=3):\n",
    "    n = 0\n",
    "    for u,v_cnt in dic.items():\n",
    "        if n >= 3:\n",
    "            break\n",
    "        n += 1    \n",
    "        m = 1\n",
    "        for v, cnt in v_cnt.items():\n",
    "            if m >= 3:\n",
    "                break\n",
    "            m += 1\n",
    "            print(u, v, cnt)\n",
    "\n",
    "def print_1_dim_dic(dic, n=3):\n",
    "    n = 0\n",
    "    for u,i_cnt in dic.items():\n",
    "        if n >= 3:\n",
    "            break\n",
    "        n += 1    \n",
    "        print(u, i_cnt)\n",
    "\n",
    "def sort_2_dim_dic(dic, k, n=5):\n",
    "    return sorted(dic[k].items(), key=lambda x:x[1],reverse=True)[:n]\n",
    "\n",
    "def sort_1_dim_dic(dic, n=5):\n",
    "    return sorted(dic.items(), key=lambda x:x[1],reverse=True)[:n]\n",
    "\n",
    "def trans_dic_2_matrix(dic):\n",
    "     return DataFrame(dic).T.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T13:00:18.979850Z",
     "start_time": "2020-08-09T13:00:17.232651Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1,1193,5,978300760\n",
      "1,661,3,978302109\n",
      "1,914,3,978301968\n",
      "1,3408,4,978300275\n",
      "1,2355,5,978824291\n",
      "1,1197,3,978302268\n",
      "1,1287,5,978302039\n",
      "1,2804,5,978300719\n",
      "1,594,4,978302268\n",
      "1,919,4,978301368\n",
      "1,595,5,978824268\n"
     ]
    }
   ],
   "source": [
    "# user, movie, rating, _\n",
    "path = os.path.join('ml-1m', 'ratings.dat')\n",
    "ucf = UserBasedCF(path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:23:50.147796Z",
     "start_time": "2020-08-09T11:21:27.382128Z"
    }
   },
   "outputs": [],
   "source": [
    "W,C,N = ucf.UserSimilarity()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:07.876691Z",
     "start_time": "2020-08-09T11:23:50.149971Z"
    }
   },
   "outputs": [],
   "source": [
    "# 用户共现矩阵\n",
    "df_c = trans_dic_2_matrix(C)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:07.884375Z",
     "start_time": "2020-08-09T11:24:07.879237Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6040, 6040)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_c.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:07.910649Z",
     "start_time": "2020-08-09T11:24:07.886585Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>3715</th>\n",
       "      <th>5294</th>\n",
       "      <th>5539</th>\n",
       "      <th>1266</th>\n",
       "      <th>5114</th>\n",
       "      <th>5925</th>\n",
       "      <th>1851</th>\n",
       "      <th>4585</th>\n",
       "      <th>80</th>\n",
       "      <th>2544</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>715</th>\n",
       "      <td>16.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>36.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3715</th>\n",
       "      <td>0.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>57.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>54.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5294</th>\n",
       "      <td>38.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5539</th>\n",
       "      <td>35.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>85.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>157.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1266</th>\n",
       "      <td>57.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>85.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>48.0</td>\n",
       "      <td>115.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>109.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5114</th>\n",
       "      <td>39.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5925</th>\n",
       "      <td>35.0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>48.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>60.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1851</th>\n",
       "      <td>67.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>115.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>87.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4585</th>\n",
       "      <td>65.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>59.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>10.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      3715  5294  5539   1266  5114  5925   1851  4585    80   2544\n",
       "715   16.0  11.0  35.0   31.0  13.0  21.0   28.0  16.0   4.0   36.0\n",
       "3715   0.0  38.0  35.0   57.0  39.0  35.0   67.0  65.0  10.0   54.0\n",
       "5294  38.0   0.0  32.0   30.0  20.0  24.0   38.0  27.0   4.0   40.0\n",
       "5539  35.0  32.0   0.0   85.0  21.0  52.0   46.0  26.0   7.0  157.0\n",
       "1266  57.0  30.0  85.0    0.0  35.0  48.0  115.0  63.0  21.0  109.0\n",
       "5114  39.0  20.0  21.0   35.0   0.0  20.0   23.0  22.0   7.0   33.0\n",
       "5925  35.0  24.0  52.0   48.0  20.0   0.0   27.0  32.0   8.0   60.0\n",
       "1851  67.0  38.0  46.0  115.0  23.0  27.0    0.0  65.0   9.0   87.0\n",
       "4585  65.0  27.0  26.0   63.0  22.0  32.0   65.0   0.0  12.0   59.0\n",
       "80    10.0   4.0   7.0   21.0   7.0   8.0    9.0  12.0   0.0   14.0"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_c.iloc[:10,:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:07.919067Z",
     "start_time": "2020-08-09T11:24:07.912775Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('1088', 28), ('2073', 27), ('678', 26), ('2529', 25), ('5795', 25)]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sort_2_dim_dic(C, '1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:07.928488Z",
     "start_time": "2020-08-09T11:24:07.923627Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "715 71\n",
      "3715 222\n",
      "5294 109\n"
     ]
    }
   ],
   "source": [
    "# 用户产生行为的物品个数\n",
    "print_1_dim_dic(N)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:07.937093Z",
     "start_time": "2020-08-09T11:24:07.931093Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('4169', 1635), ('1680', 1306), ('4277', 1230), ('1941', 1118), ('889', 1091)]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sort_1_dim_dic(N)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:23.110182Z",
     "start_time": "2020-08-09T11:24:07.938934Z"
    }
   },
   "outputs": [],
   "source": [
    "# 用户和用户相似度矩阵\n",
    "df_w = trans_dic_2_matrix(W)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:23.116836Z",
     "start_time": "2020-08-09T11:24:23.112366Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6040, 6040)"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_w.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:23.136002Z",
     "start_time": "2020-08-09T11:24:23.118611Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>3715</th>\n",
       "      <th>5294</th>\n",
       "      <th>5539</th>\n",
       "      <th>1266</th>\n",
       "      <th>5114</th>\n",
       "      <th>5925</th>\n",
       "      <th>1851</th>\n",
       "      <th>4585</th>\n",
       "      <th>80</th>\n",
       "      <th>2544</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>715</th>\n",
       "      <td>0.127443</td>\n",
       "      <td>0.125040</td>\n",
       "      <td>0.234409</td>\n",
       "      <td>0.162910</td>\n",
       "      <td>0.149852</td>\n",
       "      <td>0.212927</td>\n",
       "      <td>0.177368</td>\n",
       "      <td>0.122570</td>\n",
       "      <td>0.076015</td>\n",
       "      <td>0.187901</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3715</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.244283</td>\n",
       "      <td>0.132564</td>\n",
       "      <td>0.169400</td>\n",
       "      <td>0.254235</td>\n",
       "      <td>0.200693</td>\n",
       "      <td>0.240019</td>\n",
       "      <td>0.281599</td>\n",
       "      <td>0.107471</td>\n",
       "      <td>0.159394</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5294</th>\n",
       "      <td>0.244283</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.172970</td>\n",
       "      <td>0.127240</td>\n",
       "      <td>0.186065</td>\n",
       "      <td>0.196398</td>\n",
       "      <td>0.194275</td>\n",
       "      <td>0.166934</td>\n",
       "      <td>0.061350</td>\n",
       "      <td>0.168501</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5539</th>\n",
       "      <td>0.132564</td>\n",
       "      <td>0.172970</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.212407</td>\n",
       "      <td>0.115107</td>\n",
       "      <td>0.250714</td>\n",
       "      <td>0.138561</td>\n",
       "      <td>0.094712</td>\n",
       "      <td>0.063256</td>\n",
       "      <td>0.389663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1266</th>\n",
       "      <td>0.169400</td>\n",
       "      <td>0.127240</td>\n",
       "      <td>0.212407</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.150532</td>\n",
       "      <td>0.181592</td>\n",
       "      <td>0.271806</td>\n",
       "      <td>0.180074</td>\n",
       "      <td>0.148902</td>\n",
       "      <td>0.212274</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5114</th>\n",
       "      <td>0.254235</td>\n",
       "      <td>0.186065</td>\n",
       "      <td>0.115107</td>\n",
       "      <td>0.150532</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.165965</td>\n",
       "      <td>0.119240</td>\n",
       "      <td>0.137932</td>\n",
       "      <td>0.108871</td>\n",
       "      <td>0.140966</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5925</th>\n",
       "      <td>0.200693</td>\n",
       "      <td>0.196398</td>\n",
       "      <td>0.250714</td>\n",
       "      <td>0.181592</td>\n",
       "      <td>0.165965</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.123126</td>\n",
       "      <td>0.176475</td>\n",
       "      <td>0.109445</td>\n",
       "      <td>0.225448</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1851</th>\n",
       "      <td>0.240019</td>\n",
       "      <td>0.194275</td>\n",
       "      <td>0.138561</td>\n",
       "      <td>0.271806</td>\n",
       "      <td>0.119240</td>\n",
       "      <td>0.123126</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.223952</td>\n",
       "      <td>0.076923</td>\n",
       "      <td>0.204230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4585</th>\n",
       "      <td>0.281599</td>\n",
       "      <td>0.166934</td>\n",
       "      <td>0.094712</td>\n",
       "      <td>0.180074</td>\n",
       "      <td>0.137932</td>\n",
       "      <td>0.176475</td>\n",
       "      <td>0.223952</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.124035</td>\n",
       "      <td>0.167495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>0.107471</td>\n",
       "      <td>0.061350</td>\n",
       "      <td>0.063256</td>\n",
       "      <td>0.148902</td>\n",
       "      <td>0.108871</td>\n",
       "      <td>0.109445</td>\n",
       "      <td>0.076923</td>\n",
       "      <td>0.124035</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.098594</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          3715      5294      5539      1266      5114      5925      1851  \\\n",
       "715   0.127443  0.125040  0.234409  0.162910  0.149852  0.212927  0.177368   \n",
       "3715  0.000000  0.244283  0.132564  0.169400  0.254235  0.200693  0.240019   \n",
       "5294  0.244283  0.000000  0.172970  0.127240  0.186065  0.196398  0.194275   \n",
       "5539  0.132564  0.172970  0.000000  0.212407  0.115107  0.250714  0.138561   \n",
       "1266  0.169400  0.127240  0.212407  0.000000  0.150532  0.181592  0.271806   \n",
       "5114  0.254235  0.186065  0.115107  0.150532  0.000000  0.165965  0.119240   \n",
       "5925  0.200693  0.196398  0.250714  0.181592  0.165965  0.000000  0.123126   \n",
       "1851  0.240019  0.194275  0.138561  0.271806  0.119240  0.123126  0.000000   \n",
       "4585  0.281599  0.166934  0.094712  0.180074  0.137932  0.176475  0.223952   \n",
       "80    0.107471  0.061350  0.063256  0.148902  0.108871  0.109445  0.076923   \n",
       "\n",
       "          4585        80      2544  \n",
       "715   0.122570  0.076015  0.187901  \n",
       "3715  0.281599  0.107471  0.159394  \n",
       "5294  0.166934  0.061350  0.168501  \n",
       "5539  0.094712  0.063256  0.389663  \n",
       "1266  0.180074  0.148902  0.212274  \n",
       "5114  0.137932  0.108871  0.140966  \n",
       "5925  0.176475  0.109445  0.225448  \n",
       "1851  0.223952  0.076923  0.204230  \n",
       "4585  0.000000  0.124035  0.167495  \n",
       "80    0.124035  0.000000  0.098594  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_w.iloc[:10, :10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:23.150497Z",
     "start_time": "2020-08-09T11:24:23.139950Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('1383', 0.3173581411309744),\n",
       " ('1422', 0.3078467709206799),\n",
       " ('1988', 0.30703219093409484),\n",
       " ('4834', 0.3060561195977717),\n",
       " ('1051', 0.3047032960847922)]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sort_2_dim_dic(W, '520')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:23.164331Z",
     "start_time": "2020-08-09T11:24:23.154737Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'1307': 4.354153323994651,\n",
       " '1954': 4.354153323994651,\n",
       " '2997': 4.046306553073971,\n",
       " '1956': 4.045491973087386,\n",
       " '3198': 4.036795182863676,\n",
       " '1097': 4.036795182863676,\n",
       " '1225': 4.035980602877091,\n",
       " '2797': 3.7392743621398763,\n",
       " '908': 3.7384597821532912,\n",
       " '953': 3.7384597821532912}"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recomend = ucf.Recommend('520')\n",
    "recomend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:37.644295Z",
     "start_time": "2020-08-09T11:24:23.167227Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.06939630584679639, 0.3444205298013245)"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ucf.recallAndPrecision()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 用户相似度的改进\n",
    "\n",
    "余弦计算相似度\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$W_{uv}=\\frac{|N(u)\\cap N(v)|}{\\sqrt|N(u)||N(v)|}$</font></div>\n",
    "\n",
    "以图书为例，两个用户都曾经购买过《新华字典》，这丝毫不能说明他们兴趣相似，因为绝大部分中国人小时候都买过《新华字典》，但如果两个用户都买过《数据挖掘导论》，那可以认为他们的兴趣相似，因为只有研究过数据挖掘的人才会买这本书。\n",
    "\n",
    "换句话说，两个用户对冷门物品采取同样的行为，更能说明他们兴趣的相似度。因此，John S.Breese 在论文中提出了如下公式，根据用户行为计算用户兴趣相似度：\n",
    "\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$W_{uv}=\\frac{\\sum_{i \\in N(u) \\cap N(v) \\frac{1}{log1+|N(i)|}}}{\\sqrt|N(u)||N(v)|}$</font></div>\n",
    "\n",
    "该公式通过 <font size=5 width=\"50%\" height=\"50%\">$\\frac{1}{log1+|N(i)|}$</font> 惩罚了用户u和用户v共同兴趣列表中热门物品对他们相似度的影响。𝑁(𝑖) 是对物品𝑖有过行为的用户集合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:37.654492Z",
     "start_time": "2020-08-09T11:24:37.646066Z"
    }
   },
   "outputs": [],
   "source": [
    "def UserSimilarity(self):\n",
    "    #建立物品-用户的倒排表\n",
    "    self.item_users = dict()\n",
    "    for user,items in self.train.items():\n",
    "        for i in items.keys():\n",
    "            if i not in self.item_users:\n",
    "                self.item_users[i] = set()\n",
    "            self.item_users[i].add(user)\n",
    "\n",
    "    #计算用户-用户共现矩阵\n",
    "    C = dict()  #用户-用户共现矩阵\n",
    "    N = dict()  #用户产生行为的物品个数\n",
    "    for i,users in self.item_users.items():\n",
    "        for u in users:\n",
    "            N.setdefault(u,0)\n",
    "            N[u] += 1\n",
    "            C.setdefault(u,{})\n",
    "            for v in users:\n",
    "                if u == v:\n",
    "                    continue\n",
    "                C[u].setdefault(v,0)\n",
    "                C[u][v] += 1 / math.log(1+len(u))\n",
    "\n",
    "    #计算用户-用户相似度，余弦相似度\n",
    "    self.W = dict()      #相似度矩阵\n",
    "    for u,related_users in C.items():\n",
    "        self.W.setdefault(u,{})\n",
    "        for v,cuv in related_users.items():\n",
    "            self.W[u][v] = cuv / math.sqrt(N[u] * N[v])\n",
    "    return self.W, C, N"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 物品协同\n",
    "给用户推荐那些和他们之前喜欢的物品相似的物品。比如：该算法会因为你购买过《数据挖掘导论》而给你推荐《机器学习》。\n",
    "\n",
    "ItemCF并不是利用物品内容属性计算物品相似度，而是通过分析用户的行为记录计算物品之间的相似度。\n",
    "\n",
    "该算法认为物品A和物品B之间具有很大的相似度是因为用喜欢物品A的用户大都喜欢物品B。\n",
    "\n",
    "## 目标\n",
    "1. 计算物品之间的相似度。\n",
    "2. 根据物品的相似度和用户的历史行为给用户生成推荐列表。\n",
    "\n",
    "## 物品相似度\n",
    "可以理解为：喜欢物品i的用户中有多少比例的用户也喜欢物品j\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$W_{ij}=\\frac{|N(i) \\cap N(j)|}{|N(i)|}$</font></div>\n",
    "\n",
    "$|N(i) \\cap N(j)|$是同时喜欢物品i和物品j的用户数，$N(i)$是喜欢物品i的用户数\n",
    "\n",
    "<font size=4>**思考：如果物品j很热门，很多人都喜欢，那么$w_{ij}$就会很大，接近1。**</font>\n",
    "\n",
    "因此，该公式会造成任何物品都会和热门物品有很大的相似度，这对于致力于挖掘长尾信息的推荐系统来说显然不是一个好的特性。\n",
    "\n",
    "为避免推荐出热门的物品，可以用下面的公式：\n",
    "\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$W_{ij}=\\frac{|N(i) \\cap N(j)|}{\\sqrt|N(i)||N(j)|}$</font></div>\n",
    "\n",
    "这个公式惩罚了物品j的权重，因此减轻了热门物品会和很多物品相似的可能性。\n",
    "\n",
    "和UserCF算法类似，ItemCF算法计算物品相似度时，也可以建立用户-物品倒排表(对每个用户建立一个包含他喜欢的物品列表。\n",
    "\n",
    "```python\n",
    "def ItemSimilarity(self):\n",
    "        #建立物品-物品的共现矩阵\n",
    "        C = dict()  #物品-物品的共现矩阵\n",
    "        N = dict()  #物品被多少个不同用户购买\n",
    "        for user,items in self.train.items():\n",
    "            for i in items.keys():\n",
    "                N.setdefault(i,0)\n",
    "                N[i] += 1\n",
    "                C.setdefault(i,{})\n",
    "                for j in items.keys():\n",
    "                    if i == j: \n",
    "                        continue\n",
    "                    C[i].setdefault(j,0)\n",
    "                    C[i][j] += 1\n",
    "        #计算相似度矩阵\n",
    "        self.W = dict()\n",
    "        for i,related_items in C.items():\n",
    "            self.W.setdefault(i,{})\n",
    "            for j,cij in related_items.items():\n",
    "                self.W[i][j] = cij / (math.sqrt(N[i] * N[j]))\n",
    "        return self.W, C, N\n",
    "```\n",
    "\n",
    "最左边是输入的用户行为记录，每一行代表一个用户感兴趣的物品集合。然后，对于每个物品集合，将里面的物品两两加1，得到一个矩阵。最终将这些矩阵相加得到共现矩阵C，其中C[i][j]记录了同时喜欢物品i和物品j的用户数，最后，将C矩阵归一化得到物品之间的余弦相似度矩阵W。\n",
    "![image.png](物品-物品共现矩阵.png)\n",
    "\n",
    "## 给用户推荐和他喜欢的物品列表中最相似的物品\n",
    "计算用户u对物品j感兴趣程度：\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$P_{uj}=\\sum_{i \\in N(u) \\cap S(j,K)}w_{ji}r_{ui}$</font></div>\n",
    "\n",
    "$N(u)$是用户喜欢的物品的集合，$S(j,K)$是和物品j最相似的K个物品的集合，$w_{ji}$是物品j和物品i的相似度，$r_{ui}$是用户u对物品i的兴趣。\n",
    "\n",
    "![image.png](用户对物品j的感兴趣程度.png)\n",
    "\n",
    "## 物品协同代码实现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:37.673951Z",
     "start_time": "2020-08-09T11:24:37.656832Z"
    }
   },
   "outputs": [],
   "source": [
    "class ItemBasedCF:\n",
    "    def __init__(self, path):\n",
    "        self.train = {} #用户-物品的评分表 训练集\n",
    "        self.test = {} #用户-物品的评分表 测试集\n",
    "        self.generate_dataset(path)\n",
    "\n",
    "    def loadfile(self, path):\n",
    "        with open(path, 'r', encoding='utf-8') as fp:\n",
    "            for i, line in enumerate(fp):\n",
    "                yield line.strip('\\r\\n')\n",
    "\n",
    "    \n",
    "    def generate_dataset(self, path, pivot=0.7):\n",
    "        #读取文件，并生成用户-物品的评分表和测试集\n",
    "        i = 0\n",
    "        for line in self.loadfile(path):\n",
    "            user, movie, rating, _ = line.split('::')\n",
    "            if i <= 10:\n",
    "                print('{},{},{},{}'.format(user, movie, rating, _))\n",
    "            i += 1\n",
    "            if random.random() < pivot:\n",
    "                self.train.setdefault(user, {})\n",
    "                self.train[user][movie] = int(rating)\n",
    "            else:\n",
    "                self.test.setdefault(user, {})\n",
    "                self.test[user][movie] = int(rating)\n",
    "\n",
    "\n",
    "    def ItemSimilarity(self):\n",
    "        #建立物品-物品的共现矩阵\n",
    "        C = dict()  #物品-物品的共现矩阵\n",
    "        N = dict()  #物品被多少个不同用户购买\n",
    "        for user,items in self.train.items():\n",
    "            for i in items.keys():\n",
    "                N.setdefault(i,0)\n",
    "                N[i] += 1\n",
    "                C.setdefault(i,{})\n",
    "                for j in items.keys():\n",
    "                    if i == j: \n",
    "                        continue\n",
    "                    C[i].setdefault(j,0)\n",
    "                    C[i][j] += 1\n",
    "        #计算相似度矩阵\n",
    "        self.W = dict()\n",
    "        for i,related_items in C.items():\n",
    "            self.W.setdefault(i,{})\n",
    "            for j,cij in related_items.items():\n",
    "                self.W[i][j] = cij / (math.sqrt(N[i] * N[j]))\n",
    "        return self.W, C, N\n",
    "\n",
    "    #给用户user推荐，前K个相关用户\n",
    "    def Recommend(self,u,K=3,N=10):\n",
    "        rank = dict()\n",
    "        action_item = self.train[u]     #用户u产生过行为的item和评分\n",
    "        for i,score in action_item.items():\n",
    "            # j：物品j\n",
    "            # wj：物品i和物品j的相似度\n",
    "            for j,wj in sorted(self.W[i].items(),key=lambda x:x[1],reverse=True)[0:K]:                \n",
    "                if j in action_item.keys():\n",
    "                    continue\n",
    "                rank.setdefault(j,0)\n",
    "                # 用户u对物品j感兴趣程度：用户对物品i的打分 * 物品i和物品j的相似度\n",
    "                rank[j] += score * wj\n",
    "        return dict(sorted(rank.items(),key=lambda x:x[1],reverse=True)[0:N])\n",
    "    \n",
    "    # 计算召回率和准确率\n",
    "    # 召回率 = 推荐的物品数 / 所有物品集合\n",
    "    # 准确率 = 推荐对的数量 / 推荐总数\n",
    "    def recallAndPrecision(self,k=8,nitem=10):\n",
    "        hit = 0\n",
    "        recall = 0\n",
    "        precision = 0\n",
    "        for user, items in self.test.items():\n",
    "            rank = self.Recommend(user, K=k, N=nitem)\n",
    "            hit += len(set(rank.keys()) & set(items.keys()))\n",
    "            recall += len(items)\n",
    "            precision += nitem\n",
    "        return (hit / (recall * 1.0),hit / (precision * 1.0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:24:39.061874Z",
     "start_time": "2020-08-09T11:24:37.675925Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1,1193,5,978300760\n",
      "1,661,3,978302109\n",
      "1,914,3,978301968\n",
      "1,3408,4,978300275\n",
      "1,2355,5,978824291\n",
      "1,1197,3,978302268\n",
      "1,1287,5,978302039\n",
      "1,2804,5,978300719\n",
      "1,594,4,978302268\n",
      "1,919,4,978301368\n",
      "1,595,5,978824268\n"
     ]
    }
   ],
   "source": [
    "path = os.path.join('ml-1m', 'ratings.dat')\n",
    "icf = ItemBasedCF(path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:49.425206Z",
     "start_time": "2020-08-09T11:24:39.064226Z"
    }
   },
   "outputs": [],
   "source": [
    "# 计算物品\n",
    "i_W, i_C, i_N = icf.ItemSimilarity()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:54.987589Z",
     "start_time": "2020-08-09T11:25:49.430895Z"
    }
   },
   "outputs": [],
   "source": [
    "# 物品-物品共现矩阵\n",
    "df_ic = trans_dic_2_matrix(i_C)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:55.009615Z",
     "start_time": "2020-08-09T11:25:54.989868Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>661</th>\n",
       "      <th>3408</th>\n",
       "      <th>1197</th>\n",
       "      <th>1287</th>\n",
       "      <th>594</th>\n",
       "      <th>919</th>\n",
       "      <th>595</th>\n",
       "      <th>2398</th>\n",
       "      <th>2918</th>\n",
       "      <th>1035</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1193</th>\n",
       "      <td>101.0</td>\n",
       "      <td>226.0</td>\n",
       "      <td>415.0</td>\n",
       "      <td>182.0</td>\n",
       "      <td>177.0</td>\n",
       "      <td>427.0</td>\n",
       "      <td>209.0</td>\n",
       "      <td>122.0</td>\n",
       "      <td>303.0</td>\n",
       "      <td>191.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>661</th>\n",
       "      <td>0.0</td>\n",
       "      <td>77.0</td>\n",
       "      <td>173.0</td>\n",
       "      <td>53.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>153.0</td>\n",
       "      <td>161.0</td>\n",
       "      <td>46.0</td>\n",
       "      <td>104.0</td>\n",
       "      <td>124.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3408</th>\n",
       "      <td>77.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>94.0</td>\n",
       "      <td>110.0</td>\n",
       "      <td>219.0</td>\n",
       "      <td>148.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>202.0</td>\n",
       "      <td>161.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1197</th>\n",
       "      <td>173.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>234.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>540.0</td>\n",
       "      <td>336.0</td>\n",
       "      <td>129.0</td>\n",
       "      <td>539.0</td>\n",
       "      <td>260.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>53.0</td>\n",
       "      <td>94.0</td>\n",
       "      <td>234.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>222.0</td>\n",
       "      <td>139.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>138.0</td>\n",
       "      <td>112.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>594</th>\n",
       "      <td>130.0</td>\n",
       "      <td>110.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>120.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>312.0</td>\n",
       "      <td>271.0</td>\n",
       "      <td>101.0</td>\n",
       "      <td>203.0</td>\n",
       "      <td>189.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>919</th>\n",
       "      <td>153.0</td>\n",
       "      <td>219.0</td>\n",
       "      <td>540.0</td>\n",
       "      <td>222.0</td>\n",
       "      <td>312.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>326.0</td>\n",
       "      <td>143.0</td>\n",
       "      <td>358.0</td>\n",
       "      <td>285.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>595</th>\n",
       "      <td>161.0</td>\n",
       "      <td>148.0</td>\n",
       "      <td>336.0</td>\n",
       "      <td>139.0</td>\n",
       "      <td>271.0</td>\n",
       "      <td>326.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>93.0</td>\n",
       "      <td>233.0</td>\n",
       "      <td>190.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2398</th>\n",
       "      <td>46.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>129.0</td>\n",
       "      <td>83.0</td>\n",
       "      <td>101.0</td>\n",
       "      <td>143.0</td>\n",
       "      <td>93.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>111.0</td>\n",
       "      <td>95.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2918</th>\n",
       "      <td>104.0</td>\n",
       "      <td>202.0</td>\n",
       "      <td>539.0</td>\n",
       "      <td>138.0</td>\n",
       "      <td>203.0</td>\n",
       "      <td>358.0</td>\n",
       "      <td>233.0</td>\n",
       "      <td>111.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>199.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        661   3408   1197   1287    594    919    595   2398   2918   1035\n",
       "1193  101.0  226.0  415.0  182.0  177.0  427.0  209.0  122.0  303.0  191.0\n",
       "661     0.0   77.0  173.0   53.0  130.0  153.0  161.0   46.0  104.0  124.0\n",
       "3408   77.0    0.0  275.0   94.0  110.0  219.0  148.0   61.0  202.0  161.0\n",
       "1197  173.0  275.0    0.0  234.0  275.0  540.0  336.0  129.0  539.0  260.0\n",
       "1287   53.0   94.0  234.0    0.0  120.0  222.0  139.0   83.0  138.0  112.0\n",
       "594   130.0  110.0  275.0  120.0    0.0  312.0  271.0  101.0  203.0  189.0\n",
       "919   153.0  219.0  540.0  222.0  312.0    0.0  326.0  143.0  358.0  285.0\n",
       "595   161.0  148.0  336.0  139.0  271.0  326.0    0.0   93.0  233.0  190.0\n",
       "2398   46.0   61.0  129.0   83.0  101.0  143.0   93.0    0.0  111.0   95.0\n",
       "2918  104.0  202.0  539.0  138.0  203.0  358.0  233.0  111.0    0.0  199.0"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ic.iloc[:10,:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:55.018532Z",
     "start_time": "2020-08-09T11:25:55.012775Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3662, 3662)"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_ic.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:55.024006Z",
     "start_time": "2020-08-09T11:25:55.021314Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1193 1196\n",
      "661 370\n",
      "3408 895\n"
     ]
    }
   ],
   "source": [
    "# 物品被多少个不同用户购买\n",
    "print_1_dim_dic(i_N)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:59.440397Z",
     "start_time": "2020-08-09T11:25:55.027208Z"
    }
   },
   "outputs": [],
   "source": [
    "# 物品和物品相似度矩阵\n",
    "df_iw = trans_dic_2_matrix(i_W)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:59.448090Z",
     "start_time": "2020-08-09T11:25:59.443095Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3662, 3662)"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_iw.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:59.477475Z",
     "start_time": "2020-08-09T11:25:59.451452Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>661</th>\n",
       "      <th>3408</th>\n",
       "      <th>1197</th>\n",
       "      <th>1287</th>\n",
       "      <th>594</th>\n",
       "      <th>919</th>\n",
       "      <th>595</th>\n",
       "      <th>2398</th>\n",
       "      <th>2918</th>\n",
       "      <th>1035</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1193</th>\n",
       "      <td>0.151829</td>\n",
       "      <td>0.218440</td>\n",
       "      <td>0.292423</td>\n",
       "      <td>0.238719</td>\n",
       "      <td>0.220044</td>\n",
       "      <td>0.353784</td>\n",
       "      <td>0.220968</td>\n",
       "      <td>0.217116</td>\n",
       "      <td>0.275414</td>\n",
       "      <td>0.224168</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>661</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.133807</td>\n",
       "      <td>0.219167</td>\n",
       "      <td>0.124985</td>\n",
       "      <td>0.290565</td>\n",
       "      <td>0.227912</td>\n",
       "      <td>0.306037</td>\n",
       "      <td>0.147182</td>\n",
       "      <td>0.169958</td>\n",
       "      <td>0.261653</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3408</th>\n",
       "      <td>0.133807</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.224001</td>\n",
       "      <td>0.142527</td>\n",
       "      <td>0.158082</td>\n",
       "      <td>0.209753</td>\n",
       "      <td>0.180884</td>\n",
       "      <td>0.125492</td>\n",
       "      <td>0.212251</td>\n",
       "      <td>0.218434</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1197</th>\n",
       "      <td>0.219167</td>\n",
       "      <td>0.224001</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.258658</td>\n",
       "      <td>0.288113</td>\n",
       "      <td>0.377050</td>\n",
       "      <td>0.299376</td>\n",
       "      <td>0.193471</td>\n",
       "      <td>0.412883</td>\n",
       "      <td>0.257163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1287</th>\n",
       "      <td>0.124985</td>\n",
       "      <td>0.142527</td>\n",
       "      <td>0.258658</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.234026</td>\n",
       "      <td>0.288543</td>\n",
       "      <td>0.230540</td>\n",
       "      <td>0.231717</td>\n",
       "      <td>0.196775</td>\n",
       "      <td>0.206208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>594</th>\n",
       "      <td>0.290565</td>\n",
       "      <td>0.158082</td>\n",
       "      <td>0.288113</td>\n",
       "      <td>0.234026</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.384355</td>\n",
       "      <td>0.426010</td>\n",
       "      <td>0.267252</td>\n",
       "      <td>0.274351</td>\n",
       "      <td>0.329814</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>919</th>\n",
       "      <td>0.227912</td>\n",
       "      <td>0.209753</td>\n",
       "      <td>0.377050</td>\n",
       "      <td>0.288543</td>\n",
       "      <td>0.384355</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.341541</td>\n",
       "      <td>0.252180</td>\n",
       "      <td>0.322455</td>\n",
       "      <td>0.331457</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>595</th>\n",
       "      <td>0.306037</td>\n",
       "      <td>0.180884</td>\n",
       "      <td>0.299376</td>\n",
       "      <td>0.230540</td>\n",
       "      <td>0.426010</td>\n",
       "      <td>0.341541</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.209281</td>\n",
       "      <td>0.267803</td>\n",
       "      <td>0.281974</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2398</th>\n",
       "      <td>0.147182</td>\n",
       "      <td>0.125492</td>\n",
       "      <td>0.193471</td>\n",
       "      <td>0.231717</td>\n",
       "      <td>0.267252</td>\n",
       "      <td>0.252180</td>\n",
       "      <td>0.209281</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.214749</td>\n",
       "      <td>0.237316</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2918</th>\n",
       "      <td>0.169958</td>\n",
       "      <td>0.212251</td>\n",
       "      <td>0.412883</td>\n",
       "      <td>0.196775</td>\n",
       "      <td>0.274351</td>\n",
       "      <td>0.322455</td>\n",
       "      <td>0.267803</td>\n",
       "      <td>0.214749</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.253903</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           661      3408      1197      1287       594       919       595  \\\n",
       "1193  0.151829  0.218440  0.292423  0.238719  0.220044  0.353784  0.220968   \n",
       "661   0.000000  0.133807  0.219167  0.124985  0.290565  0.227912  0.306037   \n",
       "3408  0.133807  0.000000  0.224001  0.142527  0.158082  0.209753  0.180884   \n",
       "1197  0.219167  0.224001  0.000000  0.258658  0.288113  0.377050  0.299376   \n",
       "1287  0.124985  0.142527  0.258658  0.000000  0.234026  0.288543  0.230540   \n",
       "594   0.290565  0.158082  0.288113  0.234026  0.000000  0.384355  0.426010   \n",
       "919   0.227912  0.209753  0.377050  0.288543  0.384355  0.000000  0.341541   \n",
       "595   0.306037  0.180884  0.299376  0.230540  0.426010  0.341541  0.000000   \n",
       "2398  0.147182  0.125492  0.193471  0.231717  0.267252  0.252180  0.209281   \n",
       "2918  0.169958  0.212251  0.412883  0.196775  0.274351  0.322455  0.267803   \n",
       "\n",
       "          2398      2918      1035  \n",
       "1193  0.217116  0.275414  0.224168  \n",
       "661   0.147182  0.169958  0.261653  \n",
       "3408  0.125492  0.212251  0.218434  \n",
       "1197  0.193471  0.412883  0.257163  \n",
       "1287  0.231717  0.196775  0.206208  \n",
       "594   0.267252  0.274351  0.329814  \n",
       "919   0.252180  0.322455  0.331457  \n",
       "595   0.209281  0.267803  0.281974  \n",
       "2398  0.000000  0.214749  0.237316  \n",
       "2918  0.214749  0.000000  0.253903  "
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_iw.iloc[:10, :10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:59.838704Z",
     "start_time": "2020-08-09T11:25:59.480843Z"
    }
   },
   "outputs": [],
   "source": [
    "recomend = icf.Recommend('520')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:25:59.844934Z",
     "start_time": "2020-08-09T11:25:59.840467Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'357': 12.211174702319198,\n",
       " '1196': 11.311378182752293,\n",
       " '594': 10.112069204301593,\n",
       " '1997': 9.126319045056363,\n",
       " '2997': 8.89347728210522,\n",
       " '2424': 8.728050947383766,\n",
       " '1350': 8.557122538227246,\n",
       " '1079': 7.909900119597647,\n",
       " '904': 7.521405968310045,\n",
       " '2145': 7.476983443692152}"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recomend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-08-09T11:37:03.753252Z",
     "start_time": "2020-08-09T11:25:59.847317Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.07688866483940214, 0.38130794701986753)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "icf.recallAndPrecision()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 物品相似度的归一化\n",
    "ItemCF相似度矩阵按最大值归一化，可以提高推荐准确率。\n",
    "\n",
    "<div align=center><font size=5 width=\"50%\" height=\"50%\">$w_{ij}=\\frac{w_{ij}}{max_{j}w_{ij}}$</font></div>\n",
    "\n",
    "案例：电影网站中，ItemCF计算出来的相似度一般是纪录片和纪录片的相似度更近，动画片和动画片的相似度更近，而动画片和纪录片的相似度却不一定相同。\n",
    "\n",
    "假设物品有纪录片和动画片两类，纪录片之间的相似度为0.5，动画片之间的相似度为0.6，而纪录片和动画片之间的相似度是0.2。\n",
    "\n",
    "在这种情况下，如果用户喜欢了5个纪录片和5个动画片，用ItemCF进行推荐，推荐的就都是动画片，因为动画片之间相似度大。\n",
    "\n",
    "但如果归一化之后，纪录片之间的相似度就变成了1，动画片的相似度有变成了1，在这种情况下，用户如果喜欢了5个纪录片和5个动画片，那么他的推荐列表中，纪录片和动画片的数目也应该是大致相等的。\n",
    "\n",
    "一般来说，热门的类的物品中相似度一般已经大，用户对热门物品的行为也比较多，如果不进行归一化，就会推荐比较热门的类里面的物品。\n",
    "\n",
    "![image.png](ItemCF和ItemCF-Norm对比.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 工业界推荐系统中常见的问题\n",
    "\n",
    "## 工业界，协同召回的流程\n",
    "1. 数据处理\n",
    "    - 对行为少的不活跃的用户进行过滤，行为少的用户，数据太过于稀疏，召回难度大\n",
    "    - 对用户中热门物品进行过滤，热门物品可能大部分用户都有过行为\n",
    "    - 非常活跃的用户，用户协同可能会出现一种情况，就是每个用户的topN相似用户里都有这些非常活跃的用户，所以需要适当过滤掉这些用\n",
    "2. 建立用户embedding和物品embedding，或者可以像案例这样，直接建立共现矩阵，也可以训练embedding\n",
    "3. 计算用户和N个用户的相似度，保存N个相似用户曾经看过的TopK个物品\n",
    "4. 模型(矩阵)进行定期更新（1、这个根据不同项目组的情况，可能是一天更新一次，也可能不是，看具体的情况； 2、更新的时候使用前N天(N一般可以为3-10天)的较活跃用户的数据进行更新）\n",
    "5. 每次召回一次N条，刷完N条就再继续召回\n",
    "    - 还有可能用户两次行为(上拉/下滑)之间间隔很长时间，也会进行重新召回\n",
    "    - 每次召回的数量，需要根据召回通道数以及各个召回通道配置的召回占比进行配置\n",
    "6. 为了保证用户不疲劳，一般情况下，利用user-cf计算召回结果后，会做一定的类别的去重，保证召回覆盖度。\n",
    "7. 实际过程中，根据公司核心用户的数据量大小，考虑实现工具，若数据量较大，可使用spark进行用户协同的结果计算\n",
    "8. 如果用户量实在太过巨大，可考虑使用稀疏存储的方式进行存储，即只存储含有1(或者其他值)的位置坐标索引index以及对应的值\n",
    "\n",
    "## 用户行为大多是隐性的\n",
    "\n",
    "用户的行为大部分都不会直接表现出来，以新闻类网站为例，用户阅读一篇文章，都不能很明确的表现出用户喜欢这类物品，有时候需要综合用户的点击，曝光，收藏，分享，阅读时长，评论等行为，做一个加权，然后再归一化，最终计算出一个合理的值来表示用户的喜好，往往这类行为不容易收集，而且也是比较稀疏的，需要积累很长时间，才能积累到足够的数据\n",
    "\n",
    "## 用户协同和物品协同的使用场景\n",
    "\n",
    "UserCF给用户推荐那些和他兴趣相投的用户喜欢的物品，而ItemCF是给用户推荐那些和他之前喜欢的物品类似的物品。\n",
    "\n",
    "从算法原理来看，UserCF推荐结果着重于反映和用户兴趣相似的小群体的热点，而ItemCF的推荐结果着重于维系用户的历史兴趣。UserCF的推荐更社会化，反映了用户所在的小型兴趣群体中物品的热门程度。ItemCF的推荐更加个性化，反映了用户自己的兴趣传承。\n",
    "\n",
    "### UserCF使用场景\n",
    "\n",
    "**新闻网站**\n",
    "1. 用户的兴趣不是很细化，绝大多数用户都喜欢看热点新闻，个性化推荐需求不强，比如有些用户喜欢看体育新闻，有些喜欢看社会新闻，特别细粒度的兴趣一般不存在。所以，新闻推荐，更强调抓住新闻热点，热门程度和时效性是个性化新闻推荐的重点，个性化需求相对比这两点略次要。\n",
    "\n",
    "2. 技术角度来看，新闻类物品更新非常快，每时每刻都有新内容出现，而ItemCF需要维护一张物品-物品矩阵，有物品更新，这个表也需要更新，更新比较频繁，从技术和存储上来说，都比较困难。绝大多数物品相关度表都只能做到一天更新一次，在新闻领域这是不可接受的。而UserCF只需要维护用户相似性表，虽然用户对于新用户也需要更新相似度表，但在新闻网站中，物品更新速度远快于新用户加入速度。而且对于新用户，可以给他推热门的新闻，或者用冷启动策略去实现。\n",
    "\n",
    "### ItemCF使用场景\n",
    "\n",
    "**图书，电子商务，电影网站**\n",
    "1. 这些网站中，用户的兴趣往往比较固定和持久的。比如，一个技术人员可能都是在购买技术方面的书籍，而他对书的热门程度并不是那么敏感。\n",
    "2. 这些网站中个性化推荐的任务就是帮用户发现和他研究领域相关的物品。而且这些网站的物品更新不会特别快，一天更新一次物品相似度矩阵对他们来说不会造成太大的损失，用户是可以接受的。\n",
    "\n",
    "## 协同过滤的改进\n",
    "![image.png](协同过滤的改进.png)\n",
    "### 基础算法\n",
    "\n",
    "图1为最简单的计算物品相关度的公式，分子为同时喜好 item-i & item-j 的用户数。\n",
    "\t\n",
    "### 对热门物品惩罚\n",
    "\n",
    "但是图1存在一个问题，如果 item-j 是很热门的商品，导致很多喜欢 item-i 的用户都喜欢 item-j，这时 wij 就会非常大。同样，几乎所有的物品都和 item-j 的相关度非常高，这显然是不合理的。\n",
    "\n",
    "图2中分母通过引入 N(j) 来对 item-j 的热度进行惩罚。\n",
    "\n",
    "### 对热门物品进一步惩罚\n",
    "\n",
    "如果 item-j 极度热门，上面的算法还是不够的。举个例子，《Harry Potter》非常火，买任何一本书的人都会购买它，即使通过图2的方法对它进行了惩罚，但是《Harry Potter》仍然会获得很高的相似度。这就是推荐系统领域著名的 Harry Potter Problem。\n",
    "\n",
    "图1的方式，即便引入了N(j)，因为j非常热门，所以$|N(i) \\cap N(j)|$就会越来越接近N(i)。即便上面的公式已经考虑到了j的流行度，但在实际应用中，热门的j仍然会获得比较大的相似度。\n",
    "\n",
    "如果需要进一步对热门物品惩罚，可以继续修改公式为如图3所示，在分母上加大对热门物品的惩罚。通过调节参数 α ($α \\in [0.5, 1]$)，α 越大，惩罚力度越大，热门物品的相似度越低，整体结果的平均热门程度越低。\n",
    "\n",
    "![image.png](惩罚流行度后的ItemCF推荐效果对比.png)\n",
    "\n",
    "对比不同的α惩罚热门物品后，ItemCF算法的推荐性能。\n",
    "\n",
    "当α=0.5时，就是标准的ItemCF算法，从离线实验结果来看，α只有在取值为0.5时，才会导致最高的准确率和召回率，而物品α<0.5还是α>0.5都不会带来这两个指标的提高。但是看覆盖率和平均流行度就可以发现，α越大，覆盖率就越高，并且结果的平均热门程度会降低。\n",
    "\n",
    "因此，通过这种方式可以适当牺牲准确率和召回率，显著提升结果的覆盖率和新颖性（降低流行度提高新颖性）。\n",
    "\n",
    "### 对活跃用户惩罚\n",
    "\n",
    "同样的，Item-based CF 也需要考虑活跃用户（即一个活跃用户（专门做刷单）可能买了非常多的物品）的影响，活跃用户对物品相似度的贡献应该小于不活跃用户。图4为集合了该权重的算法。\n",
    "\n",
    "## 推荐系统中常用的评估指标\n",
    "目前工业界还没有什么好的评估指标去衡量一个推荐系统。当然，在线下训练的时候，比如协同过滤，你可以像案例这样，把数据集分成训练集和测试集，数据需要做一些处理，比如，训练集是前面6天的数据，测试集是后面一天的数据，这么分的目的：不能在测试集看到用户在未来时间点发生的行为，避免时间泄露。可以线下大概评估一下协同效果。但是，用户兴趣往往是多变的，用户的行为可能这段时间喜欢娱乐，科技，可能后面一段时间喜欢财经，美女。所以最后，模型的效果，还是要根据线上用户的实际情况去评判，这也只是个参考值。\n",
    "\n",
    "所以，现有的评估指标很难去衡量一个模型或者召回结果的好坏，只能通过看用户的行为数据，推荐的物品，用户点击的多，或者分享，评论，收藏等，通过数据表现，召回能力，曝光和点击数量，推荐物品覆盖度等一些指标，去判断召回的好坏。这些一般要根据公司实际业务，更侧重什么，去调整模型和算法。\n",
    "\n",
    "比如更侧重曝光和点击，侧重用户体验，停留时长，召回多样性等。\n",
    "\n",
    "## 协同过滤的优缺点\n",
    "### 协同过滤优点\n",
    "\n",
    "协同推荐是应用最广泛的推荐算法。因为基于内容推荐的算法，需要给物品打上标签，给用户建用户画像，才能实现匹配推荐。相比之下，协同过滤简单了许多。它是仅使用用户行为的进行推荐，我们不需要对物品或信息进行完整的标签化分析，避免了一些人可能难以量化描述的概念的标签构建，又可以很好地发现用户的潜在兴趣偏好。\n",
    "    \n",
    "\n",
    "### 协同过滤缺点\n",
    "\n",
    "因为协同过滤依赖用户的历史数据，面对新的用户或者新的物品，在开始的时候没有数据或数据较少时，协同过滤算法无法做出推荐。需要等数据积累，或者其他方案进行弥补缺陷，也就是常说的冷启动的问题。\n",
    "\n",
    "机器学习领域，当精确的方式不行难以计算或者速度太慢的时候，往往会选择牺牲一点精度，达到差不多但非常快速的效果。SVD就是其中的一个例子。\n",
    "\n",
    "没有完美的算法，只有最合适的算法。现在的实践，也不是单纯用协同过滤来做推荐，而是将他们作为其中的一个或几个召回策略来使用。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "238px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
